Repairing high-speed serial links

ABSTRACT

A method and system for repairing high speed serial links is provided. The system includes a first electronic components, connected to at least a second electronic component via at least one link. At least one of the first or second electronic components has a link controller. The link controller is configured to repair serial links by detecting a link error and mapping out individual lanes of a link where the link error is detected. The link controller resumes operation, i.e., transmission of data and continues to monitor the lanes for errors. If and when additional link errors occur, the link controller identifies the lanes in which the link error occurs and deactivates those lanes. The deactivated lane(s) can not be used in further transmissions which, in turn, reduces the occurrence of intermittent link errors.

FIELD OF THE INVENTION

The present invention relates generally to the field of high-speedserial links, and more particularly to error detection and errorhandling in high-speed serial links.

BACKGROUND OF THE INVENTION

The following description of the background of the invention is providedsimply as an aid in understanding the invention and is not admitted todescribe or constitute prior art to the invention.

High-speed serial links (“HSSL”) are an increasingly-popular method forinterconnecting semiconductor components. HSSL may be used in thefollowing technologies: fabric interconnect, memory interface (FBD), I/Ointerface (PCI-Express) and CPU connections (CIS). HSSL involve complexcircuitry and can often run long distances within a system.

Link errors may occur on high-speed serial links which can causenumerous problems including transmission failure. There are a largenumber of types of link errors including, but not limited to a brokenconnection, an intermittent connection, a degraded connection, anincorrectly seated connector, system noise, soft errors in the physicalblock and hard errors in the physical block. Link errors can cause anumber of problems including a continuous stream of bit errors,seemingly random errors, the occurrence of multiple intermittent bits ordegraded bits, the failing of single or multiple bits, non-repeatingsingle-channel errors and permanent or intermittent failures of achannel or link. Causes of link errors can include the gross failure ofa connector, vibration, a cracked solder ball or trace, a corroded orcontaminated connector, poor installation, disk or DRAM, radiation andESD or latent defects. Accordingly, there are a large number of possiblefailure modes in high speed serial links, with a correspondingly largenumber of observable symptoms. Thus, it is important to be able toisolate and correct link errors in an efficient manner.

Repairing links in real-time while minimizing the risk of undetectedtransmission errors requires hardware to monitor error activity andinvoke resilience mechanisms if too many errors occur. If a link isoperating properly the error rate should be zero. However, it isdifficult to determine how to treat intermittent failure since they onlyoccur periodically. Intermittent failures on links are difficult todebug. Furthermore, certain types of multi-bit errors are not detectedby conventional CRC (cyclical redundancy checks). Allowing too manyintermittent errors to occur without corrective action to repair thelink can result in undetected serious errors and silent data corruption(SDC).

In order to minimize the probability of SDC on links, link controllersmust implement hardware to perform error handling. However, withintermittent errors it is difficult to identify individual lanes in aserial link that are causing the errors and need repair. Further, a linkcontroller's capacity to handle errors is limited by the capacity of theerror analysis engine and the rate at which link controller's mightreceive error information.

Accordingly, conventional systems address intermittent errors aftertheir occurrence reaches a certain threshold. Unfortunately, severalundetected errors can occur before repair begins and waiting to repair alink at this stage has several negative consequences. Thus, there is aneed for a system and method for efficiently detecting and repairingintermittent errors in high speed serial links during operation.

SUMMARY OF THE INVENTION

According to one embodiment, a method for repairing serial linksincludes the steps of detecting a link error, mapping out individuallanes of a link where the link error is detected, resuming operation andmonitoring the lanes for errors, identifying the lanes in which the linkerror occurs and deactivating the lanes in which the link error hasoccurred in order to reduce the occurrence of intermittent link errors.

According to another embodiment, a method for repairing serial linksincludes the steps of detecting a link error, retrying a transmissionwhen the link error is detected if the number of link errors that haveoccurred exceeds a link error threshold, retraining the link by sparringout lanes in which the link error occurred if the link error occursafter retrying the transmission, reducing the frequency at which data istransmitted if the number of link errors that occur during retrainingexceed a retraining threshold, mapping out individual lanes of a linkwhere the link error is detected, resuming operation and monitoring thelanes for errors, identifying the lanes in which the link error occursand deactivating the lanes in which the link error has occurred in orderto reduce the occurrence of intermittent link errors.

According to yet another embodiment, a system for repairing seriallinks, includes a first electronic component, connected to at least asecond electronic component via at least one link, wherein at least oneof the first or second electronic components has a link controller. Thelink controller is configured to detect link errors, map out individuallanes when a link error is detected, monitor the lanes for errors,identify the lanes in which an error occurs and deactivate the lanes inwhich the error has occurred in order to reduce the occurrence ofintermittent link errors.

It is to be understood that both the foregoing general description andthe following detailed description are exemplary and explanatory only,and are not restrictive of the invention as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

Features, aspects and advantages of the present invention will becomeapparent from the following description, appended claims, and theaccompanying exemplary embodiments shown in the drawings, which arebriefly described below.

FIG. 1 is a block diagram of two ASICs connected via links according toone embodiment.

FIG. 2 is a flow chart illustrating a method for repairing a high speedserial link according to one embodiment.

FIG. 3 is a flow chart illustrating a method for repairing a high speedserial link according to another embodiment.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Embodiments of the present invention will be described below withreference to the accompanying drawings. It should be understood that thefollowing description is intended to describe exemplary embodiments ofthe invention, and not to limit the invention.

FIG. 1 shows two application specific integrated circuits (ASICs) 10,having a link controller (LC) 40 and an error analysis engine (EAE) 50,connected via a plurality of links 20. Each link 20 can have multiplelanes 30. Between the two ASICs 10 there can be one or more lanes 30 ofcommunication. Packets of data can be transmitted across each lane 30.Each lane 30 can represent a set of differential signal pairs where oneis used for transmission and one is used for reception. It should benoted that the ASICs 10 can be any type of functional electric componentor semiconductor component including but not limited to a processor,memory, a multiplexer, or a digital signal processor, etc.

The link controller 40 monitors and controls the links 20 connecting theASICs 10. If the link controller 40 detects a link error, the linkcontroller 40 can attempt to retransmit data using the link 20 and lane30 that the link error was detected on. Errors are detected using cyclicredundancy checks (CRC). Each frame of transmitted data contains CRCbits which are compared at the source and destination device. If the CRCbits do not match then an error is determined to have occurred. CRCs areuseful in finding errors in single bits and single lanes 30. The largerthe number of CRC bits the greater the chance that the CRC bits willdetect multiple bit errors. Further, the link controller 40 can trackinformation concerning link errors, including but not limited to, numberor errors that have occurred, time of errors, etc.

The link controller 40 implements at least four levels of repair/errorhandling in order to correct a detected link error. The first level ofrepair/error handling is to retry the transmission. The link controller40 simply retries the transmission. The second level of repair/errorhandling is re-training, i.e. the link controller 40 re-adjusts theexpected arrival times of the data and fixes lane 30 failures bysparring out or eliminating transmission of data on the failed lanes.The third level of repair/error handling is frequency reduction. Duringfrequency reduction, the link controller 40 reduces the frequency atwhich data is transmitted via a link 20. A fourth level of repair/errorhandling comprises mapping out lanes 30. During mapping, the linkcontroller maps out each lane 30 or a group of lanes 30 of each link 20.The link controller 40 monitors each lane 30 to determine whether anerror occurs. If an error occurs, the link controller 40 makes note ofwhich lane 30 the error occurred on so that that lane 30 and/orcorresponding link 20 is not used in future communications. The linkcontroller 40 is configured to automatically sequence through any one ofthe repair levels in any order.

A method for implementing repair/error handling in high speed seriallinks will now be described. As shown in FIGS. 2 and 3, the presentmethod steps through resilience mechanisms for repairing links one at atime in order to minimize system-level impact and silent datacorruption. Individual lanes 30 in each link 20 can be mapped outone-at-a-time. During mapping, links 20 between devices are identifiedand the width (number of lanes 30) of a link 20 can be determined. A mapof where and how data traffic will flow in each link 20 is generated. Ifan error occurs when a selected lane 30 is being used to transmit data,then the method selects a new mapped lane 30 for carrying out the datatransmission.

As shown in FIG. 3, the method may begin with a system start or reset(Step 100). According to one embodiment, a method for detecting linkerrors comprises determining whether a link error has occurred (Step110, 220). If an error has occurred, the method maps out a lane 30 orgroup of lanes 30 (Step 230) to obtain one or more mapped lanes andcontinues operation. If an error is detected on one of the mapped lanes(Step 240) the mapped lane 30 is determined to be a failure and isidentified by the link controller 40 as a lane 30 that should not beused in further communications (Step 250). The method deactivates thelane 30 and or link 20 to prevent the occurrence of additionalintermittent errors (Step 260).

According to another embodiment, a method for link error handlingemploys the above mentioned levels of error-handling in a tieredstep-by-step approach. As shown in FIG. 2, first, the method executes aCRC check (Step 110).

If the number of CRC errors exceeds a certain threshold (Step 120) thenthe method attempts retrying the data transmission (Step 130). Accordingto one embodiment, the method executes an automatic re-try upondetecting a CRC failure. According to one embodiment, if the retry failsthen the method attempts to retrain the link 30 or lane 30 upon whichthe error was detected (Step 130). According to still anotherembodiment, the method retrains the links 20 of lanes 30 and retries thetransmission in a single step (Step 130). If the number of errors thatoccur during retraining exceed a certain threshold (Step 140) then themethod determines if frequency reduction is possible (Step 150).

As shown in FIG. 2, if frequency reduction is possible, the methodreduces the frequency of the transmission, executes a retrain/retry(Step 160) and then performs a CRC check (Step 110).

According to one embodiment, if frequency reduction is not possible(Step 150) and if there exist additional links 20 or link groups thatcan be used for transmission (Step 170), the method maps out a next lane30, link 20 or group of links 20 (Step 180). Once a lane 30 or link 20is mapped, it is retrained (Step 185) and a CRC check is executed (Step110). The method may execute the above-described process for each lane30, link 20 or groups of links 20 that are mapped. If a lane 30, link 20or group of links 20 fail the above-mentioned resiliency tests and thereare no more links 20 or lanes 30 to be mapped, then the methoddetermines that the link 20 has failed (Step 190). Once failed lanes 30or links 20 are identified, the link controller 40 remembers the failedink or lane and can execute transmissions without using the failed lane30 or link 20 which, in turn will reduce the occurrence of intermittenterrors.

Each threshold (120, 140) can be determined by a counter. The counteramount can be customized based on design preferences. Further, after apredetermined length of operation (several days, weeks, etc.) the methodcan allow the EAE 50 to reset the counters so that a correctional stepis not taken in response to a cumulative number of errors that haveoccurred over an acceptable length of time.

The above-described system and method has several advantages. The methodhas a very fast response time to link errors and therefore minimizes therisk of undetected errors because the total number if errors beforerepair will be a small number. In turn, this minimizes the amount oflink degradation that can occur. Further, having the hardware counterror occurrences and automatically sequence through the resiliencemechanisms (including mapping out lanes) means that any intermittenterror will quickly be repaired, or (if necessary) the link shut downbefore the intermittent error can cause more serious performance flaws.

The foregoing description of a preferred embodiment of the invention hasbeen presented for purposes of illustration and description. It is notintended to be exhaustive or to limit the invention to the precise formdisclosed, and modifications and variations are possible in light of theabove teaching or may be acquired from practice of the invention. Theembodiment was chosen and described in order to explain the principlesof the invention and as a practical application to enable one skilled inthe art to utilize the invention in various embodiments and with variousmodification are suited to the particular use contemplated. It isintended that the scope of the invention be defined by the claimsappended hereto and their equivalents.

1. A method for repairing serial links, comprising: detecting a linkerror; mapping out individual lanes of a link where the link error isdetected; monitoring the lanes for errors during operation; identifyingthe lanes in which the link error occurs; and deactivating the lanes inwhich the link error has occurred in order to reduce the occurrence ofintermittent link errors.
 2. The method for repairing serial links asclaimed in claim 1, further comprising retrying a transmission when alink error is detected if the number of link errors that have occurredexceeds a link error threshold.
 3. The method for repairing serial linksas claimed in claim 1, further comprising automatically retrying atransmission when a link error is detected.
 4. The method for repairingserial links as claimed in claim 1, further comprising retraining thelink by sparring out lanes in which an error occurs when a link error isdetected.
 5. The method for repairing serial links as claimed in claim4, further comprising reducing the frequency at which data istransmitted if the number of link errors that occur during retrainingexceed a retraining threshold.
 6. The method for repairing serial linksas claimed in claim 1, wherein the link errors are detected using cyclicredundancy checks.
 7. A method for repairing serial links, comprising:detecting a link error; retrying a transmission when the link error isdetected if the number of link errors that have occurred exceeds a linkerror threshold; retraining the link by sparring out lanes in which thelink error occurred if the link error occurs after retrying thetransmission; reducing the frequency at which data is transmitted if thenumber of link errors that occur during retraining exceed a retrainingthreshold; identifying remaining links or groups of links for use in thetransmission if frequency reduction is not possible; mapping outindividual lanes of a link, if there are remaining links or groups oflinks; retraining the individual lanes; monitoring the lanes for errorsduring operation; and identifying the link which the link error occurs,if there are no remaining links or groups of links to be mapped.
 8. Asystem for repairing serial links, comprising; a first electroniccomponent, connected to at least a second electronic component via atleast one link, wherein at least one of the first or second electroniccomponents has a link controller, the link controller configured to:detect link errors; map out individual lanes when a link error isdetected; monitor the lanes for errors during operation; identify thelanes in which an error occurs; and deactivate the lanes in which theerror has occurred in order to reduce the occurrence of intermittentlink errors.
 9. The system for repairing serial links as claimed inclaim 8, wherein at least one of the first or second electroniccomponents further comprises a error analysis engine.
 10. The system forrepairing serial links as claimed in claim 8, wherein at least one ofthe first or second electronic components can be an application specificintegrated circuit.
 11. The system for repairing serial links as claimedin claim 8, wherein the link controller can track information concerninglink errors including the number of link error occurrences and a time atwhich the link error occurred.
 12. The system for repairing serial linksas claimed in claim 8, wherein the link controller can retry atransmission when a link error is detected.
 13. The system for repairingserial links as claimed in claim 8, wherein the link controller canretrain a link by sparring out failed lanes when a link error isdetected.
 14. The system for repairing serial links as claimed in claim8, wherein the link controller can reduce the frequency at which data istransmitted when a link error is detected.